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Markov Chain Monte Carlo: Can We 
Trust the Third Significant Figure? 

James M. Flegal, Murali Haran and Galin L. Jones 



Abstract. Current reporting of results based on Markov chain Monte 
Carlo computations could be improved. In particular, a measure of the 
accuracy of the resulting estimates is rarely reported. Thus we have 
little ability to objectively assess the quality of the reported estimates. 
We address this issue in that we discuss why Monte Carlo standard 
errors are important, how they can be easily calculated in Markov chain 
Monte Carlo and how they can be used to decide when to stop the 
simulation. We compare their use to a popular alternative in the context 
of two examples. 

Key words and phrases: Convergence diagnostic, Markov chain, Monte 
Carlo, standard errors. 
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1. INTRODUCTION 

Hoaglin and Andrews (1975) consider the general 
problem of what information should be included in 
publishing computation-based results. The goal of 
their suggestions was "...to make it easy for the 
reader to make reasonable assessments of the nu- 
merical quality of the results." In particular, Hoaglin 
and Andrews suggested that it is crucial to report 
some notion of the accuracy of the results and, for 
Monte Carlo studies this should include estimated 
standard errors. However, in settings where Markov 
chain Monte Carlo (MCMC) is used there is a cul- 
ture of rarely reporting such information. For ex- 
ample, we looked at the issues published in 2006 
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of Journal of the American Statistical Association, 
Biometrika and Journal of the Royal Statistical So- 
ciety, Series B. In these journals we found 39 pa- 
pers that used MCMC. Only three of them directly 
addressed the Monte Carlo error in the reported es- 
timates. Thus it is apparent that the readers of the 
other papers have little ability to objectively assess 
the quality of the reported estimates. We attempt 
to address this issue in that we discuss why Monte 
Carlo standard errors are important, how they can 
be easily calculated in MCMC settings and compare 
their use to a popular alternative. 

Simply put, MCMC is a method for using a com- 
puter to generate data and subsequently using stan- 
dard large sample statistical methods to estimate 
fixed, unknown quantities of a given target distribu- 
tion. (Thus, we object to calling it "Bayesian Com- 
putation.") That is, it is used to produce a point 
estimate of some characteristic of a target distribu- 
tion TT having support X. The most common use of 
MCMC is to estimate E^^g := Jy^g{x)7r{dx) where g 
is a real- valued, vr-integrable function on X. 

Suppose that X = {Xi, X2, X3, . . .} is an aperi- 
odic, irreducible, positive Harris recurrent Markov 
chain with state space X and invariant distribution 
TT (for definitions see Meyn and Tweedie (1993)). In 
this case X is Harris ergodic. Typically, estimating 
Ej^g is natural since an appeal to the Ergodic Theo- 
rem implies that if E^^lgl < 00, then, with probabil- 
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ity 1, 

1 " 

(1) gn:= -^g{Xi) ^ E^g as n ^ oo. 

1=1 

The MCMC method entails constructing a Markov 
chain X satisfying the regularity conditions described 
above and then simulating X for a finite number of 
steps, say n, and using g^ to estimate ET^g. The 
popularity of MCMC largely is due to the ease with 
which such an X can be simulated 
(Chen, Shao and Ibrahim (2000), Robert and Casella 
(1999), Liu (2001)). 

An obvious question is when should we stop the 
simulation? That is, how large should n be? Or, 
when is gn a good estimate of E-^^g? In a given ap- 
plication we usually have an idea about how many 
significant figures we want in our estimate, but how 
should this be assessed? Responsible statisticians 
and scientists want to do the right thing but out- 
put analysis in MCMC has become a muddled area 
with often conflicting advice and dubious terminol- 
ogy. This leaves many in a position where they feel 
forced to rely on intuition, folklore and heuristics. 
We believe this often leads to some poor practices: 
(A) stopping the simulation too early, (B) wasting 
potentially useful samples, and, most importantly, 
(C) providing no notion of the quality of gn as an 
estimate of Ej^g. In this paper we focus on issue (C) 
but touch briefly on (A) and (B). 

The rest of this paper is organized as follows. In 
Section 2 we briefly introduce some basic concepts 
from the theory of Markov chains. In Section 3 we 
consider estimating the Monte Carlo error of gn- 
Then Section 4 covers two methods for stopping the 
simulation and compares them in a toy example. In 
Section 5 the two methods are compared again in a 
realistic spatial model for a data set on wheat crop 
flowering dates in North Dakota. We close with some 
final remarks in Section 6. 

2. MARKOV CHAIN BASICS 

Suppose that X = {Xi,X2, . . .} is a Harris ergodic 
Markov chain with state space X and invariant dis- 
tribution vr. Forn e N := {1, 2,3,.. .} let P"(x, •) be 
the n-step Markov transition kernel; that is, for x £ 
X and a measurable set A, P"(x, A) = Fr{Xn+i G A \ 
Xi = x). An extremely useful property of X is that 
the chain will converge to the invariant distribution. 
Specifically, 

||P"(x,-) -vr(-)|UO asn^oo. 



where the left-hand side is the total variation dis- 
tance between P^(x,-) and 7r(-). (This is stronger 
than convergence in distribution.) The Markov chain 
X is geometrically ergodic if there exists a constant 
< t < 1 and a function M : X ^ M+ such that for 
any x £X, 

(2) ||P"(x,-)-^(-)ll<M(x)r 

for n S N. If M(x) is bounded, then X is uniformly 
ergodic. Thus uniform ergodicity implies geometric 
ergodicity. However, as one might imagine, finding 
M and t directly is often quite difficult in realistic 
settings. 

There has been a substantial amount of effort de- 
voted to establishing (2) in MCMC settings. For ex- 
ample, Hobert and Geyer (1998), Johnson and Jones 
(2008), Jones and Hobert (2004), Marchev and Hobert 
(2004), Mira and Tierney (2002), Robert (1995), 
Roberts and Poison (1994), Roberts and Rosenthal 
(1999), Rosenthal (1995, 1996), Roy and Hobert (2007) 
and Tierney (1994) examined Gibbs samplers while 
Christensen, Moller and Waagepetersen (2001), Douc 
et al. (2004), Fort and Moulines (2000, 2003), Geyer 
(1999), Jarner and Hansen (2000), Jarner and Roberts 
(2002), Meyn and Tweedie (1994) and Mengersen 
and Tweedie (1996) considered Metropolis-Hastings 
algorithms. 

3. MONTE CARLO ERROR 

A Monte Carlo approximation is not exact. The 
number gn is not the exact value of the integral we 
are trying to approximate. It is off by some amount, 
the Monte Carlo error, gn — E^^g. How large is the 
Monte Carlo error? Unfortunately, we can never know 
unless we know Ej^g. 

We do not know the Monte Carlo error, but we 
can get a handle on its sampling distribution. That 
is, assessing the Monte Carlo error can be accom- 
plished by estimating the variance of the asymptotic 
distribution of gn- Under regularity conditions, the 
Markov chain X and function g will admit a CLT. 
That is, 

(3) V^{gn-E^g)^NiO,a^g) 

as n ^ oo where cr^ := yarT^{g{Xi)} + 
'^J2iZ2'^^^TT{9(.Xi), g{Xi)}; the subscript ir means 
that the expectations are calculated assuming Xi ~ 
vr. The CLT holds for any initial distribution when 
either (i) X is geometrically ergodic and E^j-lg]'^'^^ < 
oo for some 5 > or (ii) X is uniformly ergodic and 
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EttQ^ < oo. These are not the only sufficient condi- 
tions for a CLT but are among the most straightfor- 
ward to state; the interested reader is pointed to the 
summaries provided by Jones (2004) and 
Roberts and Rosenthal (2004). 

Given a CLT we can assess the Monte Carlo error 
in Qn by estimating the variance, o"^. That is, we can 
calculate and report an estimate of ag , say o"^ , that 
will allow us to assess the accuracy of the point esti- 
mate. There have been many techniques introduced 
for estimating o"^; see, among others, 
Bratley, Fox and Schrage (1987), Fishman (1996), 
Geyer (1992), Glynn and Iglehart (1990), Glynn and 
Whitt (1991), Mykland, Tierney and Yu (1995) and 
Roberts (1996). For example, regenerative simula- 
tion, batch means and spectral variance estimators 
all can be appropriate in MCMC settings. We will 
consider only one of the available methods, namely 
non-overlapping batch means. We chose this method 
because it is easy to implement and can enjoy de- 
sirable theoretical properties. However, overlapping 
batch means has a reputation of sometimes being 
more efficient than nonoverlapping batch means. On 
the other hand, currently the spectral variance and 
overlapping batch means estimators require stronger 
regularity conditions than nonoverlapping batch 
means. 

3.1 Batch Means 

In nonoverlapping batch means the output is bro- 
ken into blocks of equal size. Suppose the algorithm 
is run for a total oi n = ah iterations (hence a = Un 
and b = bn are implicit functions of n) and define 

1 

Yj :=- g{Xi) for j = 1, . . . , a. 

i=(j-l)6+l 

The batch means estimate of 0"^ is 

(4) -l=^,tiy^--9n?- 

Batch means is attractive because it is easy to im- 
plement (and it is available in some software, e.g., 
WinBUGS) but some authors encourage caution in 
its use (Roberts (1996)). In particular, we believe 
careful use is warranted since (4), in general, is not 
a consistent estimator of cr^. On the other hand, 
Jones et al. (2006) showed that if the batch size and 
the number of batches are allowed to increase as the 
overall length of the simulation increases by setting 



bn = ["-^J ^'^^ o,n = [n/bn\ , then with prob- 

ability 1 as n — > oo . In this case we call it consistent 
batch means (CBM) to distinguish it from the stan- 
dard (fixed number of batches) version. The regu- 
larity conditions require that X be geometrically er- 
godic, £',r|<7p^^^^'^^ < oo for some ei > 0, £2 > and 
(1 + ei/2)-i <e<l; often 9 = 1/2 (i.e., bn = ly/n\ 
and a„ = [n/bn\) is a convenient choice that works 
well in applications. Note that the only practical dif- 
ference between CBM and standard batch means is 
that the batch number and size are chosen as func- 
tions of the overall run length, n. A simple R func- 
tion for implementing CBM or a faster command 
line C version of this function is available from the 
authors upon request. 

Using CBM to get an estimate of the Monte Carlo 
standard error (MCSE) of gn, say dgj^fn, we can 
form an asymptotically valid confidence interval for 
Et^q. The half- width of the interval is given by 



(5) 




where ta„-i is an appropriate quantile from Stu- 
dent's t distribution with Cn — l degrees of freedom. 

3.2 How Many Significant Figures? 

The title of the paper contains a rhetorical ques- 
tion; we do not always care about the third signif- 
icant figure. But we should care about how many 
significant figures there are in our estimates. As- 
sessing the Monte Carlo error through (5) gives us 
a tool to do this. For example, suppose gn = 0.02; 
then there is exactly one significant figure in the es- 
timate, namely the "2," but how confident are we 
about it? Letting h^, denote the half-width given in 

(5) of a (1 — a)100% interval, we would trust the 
one significant figure in our estimate if 0.02 it /iq C 
[0.015,0.025) since otherwise values such as Ej^g = 
0.01 or i^Trff = 0.03 are plausible through rounding. 

More generally, we can use (5) to assess how many 
significant figures we have in our estimates. This is 
illustrated in the following toy example that will be 
used several times throughout the rest of this paper. 

3.2.1 Toy example. Let Yi, . . . , Y^- be i.i.d. N(/i, A) 
and let the prior for (/i. A) be proportional to 1/\/A- 
The posterior density is characterized by 

(6) ^(^,A|y)ocA"(^+i)/2exp|-i^f;(y,-/.)2| 

where y = (yi, . . . , yx)'^ ■ R is easy to check that this 
posterior is proper as long as K > 3 and we assume 
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this throughout. Using the Gibbs sampler to make 
draws from (6) requires the full conditional densi- 
ties, /(/i|A, y) and /(A|/i,y), which are as follows: 



HA,y~N(y,A/i^), 

'K-1 {K -l)s'^ + K{y- ^if 



A|/u,y ~IG 



where y is the sample mean and {K — l)s^ = ^{yi~ 
y)^. [We say W ^\G{a,j3) if its density is propor- 
tional to w~^°'~^^'^ e~^/^ I{w > 0).] Consider estimat- 
ing the posterior means of /x and A. It is easy to prove 
that E{fi\y) = y and E{X\y) = {K - l)s'^/{K - 4) for 
K > 4:. Thus we do not need MCMC to estimate 
these quantities but we will ignore this and use the 
output of a Gibbs sampler to estimate E{^\y) and 
E{\\y). 

Consider the Gibbs sampler that updates A then 
that is, letting (A',//') denote the current state 
and (A,/i) denote the future state, the transition 
looks like (A', /x') (A, fi') (A, fi). Jones and Hobert 
(2001) established that the associated Markov chain 
is geometrically ergodic as long as X > 5. If A' > 10, 
then the moment conditions ensuring the CLT and 
the regularity conditions for CBM (with = 1/2) 
hold. 

Let K = 11, y = 1, and {K - l)s^ = 14 so that 
E[ix\y) = 1 and E[\\y) = 2; for the remainder of this 
paper these settings will be used every time we con- 
sider this example. Consider estimating E{^i\y) and 
E[\\y) with Jin and A„, respectively, and using CBM 
to calculate the MCSEs for each estimate. Specifi- 
cally, we will use a 95% confidence level in (5) to 
construct an interval estimate. Let the initial value 
for the simulation be (Ai,/xi) = (1, 1). When we ran 
the Gibbs sampler for 1000 iterations we obtained 
-^1000 = 2.003 with an MCSE of 0.055 and /iiooo = 
0.99 with an MCSE of 0.016. Thus we would be com- 
fortable reporting two significant figures for the esti- 
mates of E{X\y) and E{fj,\y), specifically 2.0 and 1.0, 
respectively. But when we started from (Ai,/xi) = 
(100, 100) and ran Gibbs for 1000 iterations we ob- 
tained Aiooo = 13.06 with an MCSE of 11.01 and 
Aiooo = 1-06 with an MCSE of 0.071. Thus we would 
not be comfortable with any significant figures for 
the estimate of E(X\y) but we would trust one signif- 
icant figure (i.e., 1) for E{fi\y). Unless the MCSE is 
calculated and reported a hypothetical reader would 
have no way to judge this independently. 



3.2.2 Remarks. 

1. A common concern about MCSEs is that their 
use may require estimating E^^g much too pre- 
cisely relative to y^var^. Of course, it would 
be a rare problem indeed where we would know 
^var^ g and not ET^g. Thus we would need to es- 
timate -^var^ g and calculate an MCSE (via the 
delta method) before we could trust the estimate 
of ^varyr g to inform us about the MCSE for Ej^g. 

2. We are not suggesting that all MCMC-based es- 
timates should be reported in terms of significant 
figures; in fact we do not do this in the simula- 
tions that occur later. Instead, we are strongly 
suggesting that an estimate of the Monte Carlo 
standard error should be used to assess simu- 
lation error and reported. Without an attached 
MCSE a point estimate should not be trusted. 

4. STOPPING THE SIMULATION 

In this section we consider two formal approaches 
to terminating the simulation. The first is based on 
calculating an MCSE and is discussed in Section 4.1. 
The second is based on the method introduced in 
Gelman and Rubin (1992) and is one of many so- 
called convergence diagnostics (Cowles and Carlin 
(1996)). Our reason for choosing the Gelman-Rubin 
diagnostic (GRD) is that it appears to be far and 
away the most popular method for stopping the sim- 
ulation. GRD and MCSE are used to stop the simu- 
lation in a similar manner. After n iterations either 
the value of the GRD or MCSE is calculated and 
if it is not sufficiently small then we continue the 
simulation until it is. 

4.1 Fixed- Width Methodology 

Suppose we have an idea of how many significant 
figures we want in our estimate. Another way of say- 
ing this is that we want the half-width of the inter- 
val (5) to be less than some user-specified value, 
e. Thus we might consider stopping the simulation 
when the MCSE of gn is sufficiently small. This, of 
course, means that we may have to check whether 
this criterion is met many times. It is not obvious 
that such a procedure will be guaranteed to termi- 
nate the computation in a finite amount of time or 
whether the resulting intervals will enjoy the desired 
coverage probability and half-width. Also, we do not 
want to check too early in the simulation since we 
will run the risk of premature termination due to a 
poor estimate of the standard error. 
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Suppose we use CBM to estimate the Monte Carlo 
standard error of Qn, say Ogj^fn^ and use it to form 
a confidence interval for E^^g. If this interval is too 
large, then the value of n is increased and simula- 
tion continues until the interval is sufficiently small. 
Formally, the criterion is given by 

(7) i.„_i4^+p(n)<e 

where ta„-i is an appropriate quantile, p(n) = el{n < 
n*) where n* > is fixed, / is the usual indicator 
function on and e > is the user-specified half- 
width. The role of p is to ensure that the simulation 
is not terminated prematurely due to a poor esti- 
mate of (Tg. The conditions which guarantee o"^ is 
consistent also imply that this procedure will termi- 
nate in a finite amount of time with probability 1 
and that the resulting intervals asymptotically have 
the desired coverage (see Glynn and Whitt (1992)). 
However, the finite-sample properties of (5) have re- 
ceived less formal investigation but simulation re- 
sults suggest that the resulting intervals have ap- 
proximately the desired coverage and half-width in 
practice (see Jones et al., 2006). 

4.1.1 Remarks. 

1. The CLT and CBM require a geometrically er- 
godic Markov chain. This can be difficult to check 
directly in any given application. On the other 
hand, considerable effort has been spent estab- 
lishing (2) for a number of Markov chains; see 
the references given at the end of Section 2. In 
our view, this is not the obstacle that it was in 
the past. 

2. The frequency with which (7) should be evalu- 
ated is an open question. Checking often, say ev- 
ery few iterations, may substantially increase the 
overall computational effort. 

3. Consider p[n) = el{n < n*). The choice of n* is 
often made based on the user's experience with 
the problem at hand. However, for geometrically 
ergodic Markov chains there is some theory that 
can give guidance on this issue (see 
Jones and Hobert (2001), Rosenthal (1995)). 

4. Stationarity of the Markov chain is not required 
for the CLT or the strong consistency of CBM. 
One consequence is that burn-in is not required 
if we can find a reasonable starting value. 



4.1.2 Toy example. We consider implementation 
of fixed-width methods in the toy example intro- 
duced in Section 3.2.1. We performed 1000 indepen- 
dent replications of the following procedure. Each 
replication of the Gibbs sampler was started from 
y. Using (7), a replication was terminated when the 
half-width of a 95% interval with p(n) = el{n < 400) 
was smaller than a prespecified cutoff, e, for both 
parameters. If both standard errors were not less 
than the cutoff, then the current chain length was 
increased by 10% before checking again. We used 
two settings for the cutoff, e = 0.06 and e = 0.04. 
These settings wih be denoted CBMl and CBM2, 
respectively. 

First, consider the estimates of E{n\y). We can 
see from Figure 1(a) and (b) that the estimates of 
E{iJ,\y) are centered around the truth with both set- 
tings. Clearly, the cutoff of e = 0.04 is more strin- 
gent and yields estimates that are closer to the true 
value. It should come as no surprise that the cost 
of this added precision is increased computational 
effort; see Table 2. The corresponding plots for 
yield the same results and are therefore excluded. 

Consider CBM2. In this case, 100% of the es- 
timates, fin, of E{^\y) and 96% of the estimates, 
A„, of E{\\y) are within the specified e = 0.04 of 
the truth. In every replication the simulation was 
stopped when the criterion (7) for E{\\y) dropped 
below the cutoff. Similar results hold for the CBMl 
{e = 0.06) setting. 

4.2 The Gelman-Rubin Diagnostic 

The Gelman-Rubin diagnostic (GRD) introduced 
in Gelman and Rubin (1992) and refined by 
Brooks and Gelman (1998) is a popular method for 
assessing the output of MCMC algorithms. It is im- 
portant to note that this method is also based on 
a Markov chain CLT (Gelman and Rubin (1992), 
page 463) and hence does not apply more generally 
than approaches based on calculating an MCSE. 

GRD is based on the simulation of m indepen- 
dent parallel Markov chains having invariant distri- 
bution vr, each of length 21. Thus the total simu- 
lation effort is 2lm. Gelman and Rubin (1992) sug- 
gest that the first / simulations should be discarded 
and inference based on the last I simulations; for the 
J th chain these are denoted {Xij , X2j , X^^j , . . . , Xij } 
with J = 1, 2, . . . , m. Recall that we are interested in 
estimating E^^g and define Yij = g{Xij), 

1 in m 

B = ^—y{Y..-Y..f and W = -Y s^. 
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(a) CBMl, with a cutoff of £ ^ 0.06. 



(b) CBM2. with a cutoff of e - 0.04. 
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(c) GRDL 2 chains and 5 ^ 1.1. 



(d) GRD3, 2 chains and S - 1.005. 
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(e) GRD2, 4 chains and 5 - 1.1. 



(f) GRD4, 4 chains and 6 = 1.005. 



Fig. 1. Histograms from 1000 replications estimating E{ii\y) for the toy example of Section 3.2.1 with CBM and GRD. 
Simulation sample sizes are given in Table 2. 



where Y.j = l~^Y.\=iYij, Y.. = m-^J2"jl^Yj and and define the corrected potential scale reduction 
s2 = (/ _ i)"i J2i=i{Yij - Yj)^. Note that Y. is the factor 

resulting point estimate of ET^g. Let 



R: 



l d + 3 V 
d + lW' 



2V 



ml 



var(F) 



As noted by Gelman et al. (2004), V and W are 
essentially two different estimators of vaTj^ g; not a"^ 
from the Markov chain CLT. That is, neither V nor 
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W address the sampling variability of gn and hence 
neither does R. 

In our examples we used the R package coda which 
reports an upper bound on R. Specifically, a 97.5% 
upper bound for R is given by 



Ri 



■0.975 



ld + 3 
d+1 



l-l 



I 



+ Fo, 



975.m— l.ui 



m+1 B 
ml W 



m—l 
w 



where -Fb.975,m-i,w is the 97.5% percentile of an F, 
distribution, w = 2W'^ /a^i and 

m 

In order to stop the simulation the user provides a 
cutoff, 5 > 0, and simulation continues until 



(8) 



^0.975 +p{n) < 5. 



As with fixed-width methods, the role of p{n) is to 
ensure that we do not stop the simulation prema- 
turely due to a poor estimate, -Ro.975- By requiring 
a minimum total simulation effort of n* = 2lm we 
are effectively setting p{n) = 51 {n < n*) where n in- 
dexes the total simulation effort. 

4.2.1 Remarks. 

1. A rule of thumb suggested by Gelman et al. (2004) 
is to set (5 = 1.1. These authors also suggest that 
a value of 5 closer to 1 will be desirable in a "final 
analysis in a critical problem" but give no further 
guidance. Since neither R nor -Ro.975 directly es- 
timates the Monte Carlo error in Qn it is unclear 
to us that i? ~ 1 implies (jn is a good estimate of 

2. How large should m be? There seem to be few 
guidelines in the literature except that m > 2 
since otherwise we cannot calculate B. Clearly, 
if m is too large then each chain will be too short 
to achieve any reasonable expectation of conver- 
gence within a given computational effort. 

3. The initial values, Xji, of the m parallel chains 
should be drawn from an "overdispersed" distri- 
bution. Gelman and Rubin (1992) suggest esti- 
mating the modes of vr and then using a mixture 
distribution whose components are centered at 
these modes. Constructing this distribution could 
be difficult and is often not done in practice (Gel- 
man et al., 2004, page 593). 

4. To our knowledge there has been no discussion in 
the literature about optimal choices of p(n) or n* . 
In particular, we know of no guidance about how 



long each of the parallel chains should be simu- 
lated before the first time we check that -Ro.975 < 
5 or how often one should check after that. How- 
ever, the same theoretical results that could give 
guidance in item 3 of Section 4.1.1 would apply 
here as well. 

5. GRD was originally introduced simply as a method 
for determining an appropriate amount of burn- 
in. However, using diagnostics in this manner may 
introduce additional bias into the results; see 
Cowles, Roberts and Rosenthal (1999). 

4.2.2 Toy example. We consider implementation 
of GRD in the toy example introduced in Section 3.2.1. 
The first issue we face is choosing the starting values 
for each of the m parallel chains. Notice that 

7r(Ai, A|y) =gi(/i|A)c/2(A) 

where gi{fi\X) is a N{y,X/K) density and 52(A) is 
an IG{{K - 2) /2, {K - l)s'^/2) density. Thus we can 
sequentially sample the exact distribution by first 
drawing from 52(A), and then conditionally, draw 
from gi{ji\X). We will use this to obtain starting val- 
ues for each of the m parallel chains. Thus each of 
the m parallel Markov chains will be stationary and 
hence GRD should be at a slight advantage com- 
pared to CBM started from y. 

Our goal is to investigate the finite-sample prop- 
erties of the GRD by considering the estimates of 
E{fi\y) and E{X\y) as in Section 4.1.2. To this end, 
we took multiple chains starting from different draws 
from the sequential sampler. The multiple chains 
were run until the total simulation effort was n* = 
400 draws; this is the same minimum simulation ef- 
fort we required of CBM in the previous section. 
If i2o.975 < ^ for both, the simulation was stopped. 
Otherwise, 10% of the current chain length was added 
to each chain before -Ro.975 was recalculated. This 
continued until -Ro.975 was below 5 for both. This 
simulation procedure was repeated independently 
1000 times with each replication using the same ini- 
tial values. We considered four settings using the 
combinations of m G {2,4} and 6 G {1.005, 1.1}. These 
settings wiU be denoted GRDl, GRD2, GRD3 and 
GRD4; see Table 1 for the different settings along 
with summary statistics that will be considered later. 

Upon completion of each replication, the values of 
fin and Xn were recorded. Figure l(c)-(f) show his- 
tograms of fin for each setting. We can see that all 
the settings center around the true value of 1, and 
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setting 6 = 1.005 provides better estimates. Increas- 
ing the number of chains seems to have httle im- 
pact on the quahty of estimation, particularly when 
S = 1.1. Histograms of A„ for each setting show sim- 
ilar trends. 

In the settings we investigated, GRD often termi- 
nated the simulations much sooner than CBM. Ta- 
ble 2 shows the percentage of the 1000 replications 
which were stopped at their minimum (n* = 400) 
and the percentage with less than 1000 total draws. 
The data clearly show that premature stopping was 
common with GRD but uncommon with CBM. This 
is especially the case for GRDl and GRD2 where 
over half the replications used the minimum simu- 
lation effort. 



Also, the simulation effort for GRD was more vari- 
able than that of CBM. In particular, the average 
simulation effort was comparable for CBMl and 
GRD3 and also CBM2 and GRD4 but the stan- 
dard errors are larger for GRD. Next, Figure 2(a) 
and (b) show a plot of the estimates, versus the 
total number of draws in the chains for CBM2 and 
GRD4. The graphs clearly show that the total num- 
ber of draws was more variable using GRD4 than 
using CBM2. From a practical standpoint, this im- 
plies that when using GRD we are likely to run a 
simulation either too long or too short. Of course, 
if we run the simulation too long, we will be likely 
to get a better estimate. But if the simulation is too 
short, the estimate can be poor. 



Table 1 

Summary table for all settings and estimated mean-squared error for estimating E{fi\y) and E{X\y) for the toy example of 

Section 3.2.1 







Stopping 


MSE for 




MSE for 




Method 


Chains 


rule 


E{^Ji\y) 


S.E. 


E(\\y) 


S.E. 


CBMl 


1 


0.06 


9.82e-05 


4.7e-06 


1.03e-03 


4.5e-05 


CBM2 


1 


0.04 


3.73e-05 


1.8e-06 


3.93e-04 


1.8e-05 


GRDl 


2 


1.1 


7.99e-04 


3.6e-05 


8.7e-03 


4e-04 


GRD 2 


4 


1.1 


7.79e-04 


3.7e-05 


8.21e-03 


3.6e-04 


GRD3 


2 


1.005 


3.49e-04 


2.1e-05 


3.68e-03 


2e-04 


GRD4 


4 


1.005 


1.34e-04 


9.2e-06 


1.65e-03 


1.2e-04 



Standard errors (S.E.) shown for each estimate. 



CBM2 



GRIM 





10000 aoooo 30000 

Number of Draws 



40000 



lOOOC 



aoooo 30O00 
Mutntor o1 Draws 



40000 



(a) /in VS. n for CBM2 



(b) iin vs. n for GRD4 



Fig. 2. Estimating E{fi\y) for the toy example of Section 3.2.1. Estimates of E{^,\y) versus number of draws for the CBM2 
and GRD4 settings. 
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Table 2 

Summary of the proportion ( and standard error) of the 
observed estimates which were based on the minimum 
number (400) of draws, less than or equal to 1000 draws, 
and the average total simulation effort for the toy example in 
Section 3.2.1. 





Prop. 




Prop. 








Method 


at min. 


S.E. 


<1000 


S.E. 


N 


S.E. 


CBMl 








0.011 


0.0033 


2191 


19.9 


CBM2 














5123 


33.2 


GRDl 


0.576 


0.016 


0.987 


0.0036 


469 


4.1 


GRD2 


0.587 


0.016 


0.993 


0.0026 


471 


4.2 


GRD3 


0.062 


0.0076 


0.363 


0.015 


2300 


83.5 


GRD4 


0.01 


0.0031 


0.083 


0.0087 


5365 


150.5 



Now we compare GRD and CBM in terms of the 
quality of estimation. Table 1 gives the estimated 
mean-squared error (MSE) for each setting based on 
1000 independent replications described above. The 
estimates for GRD were obtained using the methods 
described earlier in this subsection while the results 
for CBM were obtained from the simulations per- 
formed for Section 4.1.2. It is clear that CBM results 
in superior estimation. In particular, note that us- 
ing the setting CBMl results in better estimates of 
E{fi\y) and E{X\y) than using setting GRD4 while 
using approximately half the average simulation ef- 
fort [2191 (19.9) versus 5365 (150.5)]; see Table 2. 

Consider GRD4 and CBM2. Note that these two 
settings have comparable average simulation effort. 
The MSE for using GRD was 0.000134 (s.e. = 
9.2 X 10"^) and for CBM we observed an MSE of 
0.0000373 (1.8 X 10"*^). Now consider A„. The MSE 
based on using GRD was 0.00165 (1.2 x 10~^) and 
for CBM we observed an MSE of 0.000393 (1.8 x 
10~^). Certainly, the more variable simulation effort 
of GRD contributes to this difference but so does the 
default use of burn-in. 

Recall that we employed a sequential sampler to 
draw from the target distribution implying that the 
Markov chain is stationary and hence burn-in is un- 
necessary. To understand the effect of using burn-in 
we calculated the estimates of E{fi\y) using the en- 
tire simulation; that is, we did not discard the first / 
draws of each of the m parallel chains. This yields an 
estimated MSE of 0.0000709 (4.8 x lO"*^) for GRD4. 
Thus, the estimates using GRD4 still have an es- 
timated MSE 1.9 times larger than that obtained 
using CBM2. The standard errors of the MSE es- 
timates show that this difference is still significant, 
indicating CBM, in terms of MSE, is still a superior 



method for estimating E{fi\y). Similarly, for esti- 
mating E{X\y) the MSE using GRD4 without dis- 
carding the first half of each chain is 2.1 higher than 
that of CBM2. 

Toy examples are useful for illustration; however, 
it is sometimes difficult to know just how much cre- 
dence the resulting claims should be given. For this 
reason, we turn our attention to a setting that is 
"realistic" in the sense that it is similar to the type 
of setting encountered in practice. Specifically, we 
do not know the true values of the posterior ex- 
pectations and implementing a reasonable MCMC 
strategy is not easy. Moreover, we do not know the 
convergence rate of the associated Markov chain. 

5. A HIERARCHICAL MODEL FOR 
GEOSTATISTICS 

We consider a data set on wheat crop flowering 
dates in the state of North Dakota (Haran et al., 
2007). These data consist of experts' model-based 
estimates for the dates when wheat crops flower at 
365 different locations across the state. Let D be 
the set of N sites and the estimate for the flowering 
date at site s be Z{s) for s £ D. Let X{s) be the 
latitude for s (z D. The flowering dates are generally 
expected to be later in the year as X{s) increases 
so we assume that the expected value for Z(s) in- 
creases linearly with X{s). The flowering dates are 
also assumed to be spatially dependent, suggesting 
the following hierarchical model: 

Z{s)\p,^{s)=X{s)^ + ^{s) for seD, 

e I r2,a2,(/.~iV(0,S(r2,a2, </))), 

where C = iCisi), ■ ■ ■ ,C{sN)f with ^(r^, a^, (^) = 
T^I + a^H{(l)) and {i/((/>)}i, = exp((-||si - s,||)/</.), 
the exponential correlation function. We complete 
the specification of the model with priors on r^, cr^, 

(f), and f3, 

r2~IG(2,30), ~IG(0.1,30), 
(/) ~ Log-Unif(0.6, 6), 7r(/?) oc 1. 

Setting Z = {Z{si), . . . , Z{sn)), inference is based 
on the posterior distribution 7r(r^, cr^, 0, P\ Z). Note 
that MCMC may be required since the integrals 
required for inference are analytically intractable. 
Also, samples from this posterior distribution can 
then be used for prediction at any location s € D. 

Consider estimating the posterior expectation of 
T'^,a'^,(f) and (3. Unlike the toy example considered 
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earlier, these expectations are not analytically avail- 
able. Sampling from the posterior is accomplished 
via a Metropolis-Hastings sampler with a joint up- 
date for the r^, 4>, (i via a three-dimensional inde- 
pendent Normal proposal centered at the current 
state with a variance of 0.3 for each component and 
a univariate Gibbs update for o"^. 

To obtain a high-quality approximation to the de- 
sired posterior expectations we used a single long 
run of 500,000 iterations of the sampler and ob- 
tained 23.23 (0.0426), 25.82 (0.0200), 2.17 (0.0069) 
and 4.09 (4.3e-5) as estimates of the posterior ex- 
pectations of r^, o"^, 4> and /3, respectively. These 
are assumed to be the truth. We also recorded the 
10th, 30th, 70th and 90th percentiles of this long 
run for each parameter. 

Our goal is to compare the finite-sample proper- 
ties of GRD and CBM in terms of quality of esti- 
mation and overall simulation effort. Consider im- 
plementation of GRD. We will produce 100 inde- 
pendent replications using the following procedure. 
For each replication we used m = 4 parallel chains 
from four different starting values corresponding to 
the 10th, 30th, 70th and 90th percentiles recorded 
above. A minimum total simulation effort of 1000 
(250 per chain) was required. Also, no burn-in was 
employed. This is consistent with our finding in the 
toy example that estimation improved without using 
burn-in. Each replication continued until -Ro.975 ^ 
1.1 for all of the parameter estimates. Estimates of 
the posterior expectations were obtained by averag- 
ing draws across all four parallel chains. 

Now consider the implementation of CBM. For the 
purpose of easy comparison with GRD, we ran a to- 
tal of 400 independent replications of our MCMC 
sampler, where the 10th, 30th, 70th and 90th per- 
centiles of the parameter samples from the long run 
were used as starting values for 100 replications each. 
Each replication was simulated for a minimum of 
1000 iterations so p(n) = el{n < 1000). Thus the 
minimum simulation effort is the same as that for 
GRD. Using (7), a single replication (chain) was ter- 
minated when each of the half-widths of a 95% in- 
terval was smaller than 0.5, 0.5, 0.05 and 0.05 for the 
estimates of the posterior expectations of r^, a^, (p 
and (3, respectively. These thresholds correspond to 
reasonable desired accuracies for the parameters. If 
the half-width was not less than the cutoff, then 10 
iterations were added to the chain before checking 
again. 



Table 3 

Summary of estimated mean-squared error obtained using 
CBM and GRD for the model of Section 5 



Method 
Parameter 




GRD 


CBM 


MSE 


S.E. 


MSE 


S.E. 




0.201 


0.0408 


0.0269 


0.00185 


E(a'\z) 


0.0699 


0.0179 


0.00561 


0.00039 


E{4>\z) 


0.00429 


0.00061 


0.000875 


5.76e-05 


E{P\z) 


1.7e-07 


3.09e-08 


3.04e-08 


1.89e-09 



Standard errors (S.E.) shown for each estimate. 



The results from our simulation study are sum- 
marized in Table 3. Clearly, the MSE for estimates 
using GRD are significantly higher than the MSE 
for estimates obtained using CBM. However, CBM 
required a greater average simulation effort 31,568.9 
(177.73) than did GRD 8,082 (525.7). To study 
whether the CBM stopping rule delivered confidence 
intervals at the desired 95% levels, we also estimated 
the coverage probabilities for the intervals for the 
posterior expectations of r^, a^, cj) and /3, which 
were 0.948 (0.0112), 0.945 (0.0114), 0.912 (0.0141) 
and 0.953 (0.0106), respectively. The coverage for all 
parameters is fairly close to the desired 95%. 

Finally, we note that this simulation study was 
conducted on a Linux cluster using R (Ihaka and 
Gentleman (1996)), an MCMC package for spatial 
modeling, spBayes (Finley, Banerjee and Carlin 
(2007)) and the parallel random number generator 
package rlecuyer (L'Ecuyer et al., 2002). 

6. DISCUSSION 

In our view, the point of this paper is that those 
examining the results of MCMC computations are 
much better off when reliable techniques are used 
to estimate MCSEs and then the MCSEs are re- 
ported. An MCSE provides two desirable proper- 
ties: (1) It gives useful information about the qual- 
ity of the subsequent estimation and inference; and 
(2) it provides a theoretically justified, yet easily 
implemented, approach for determining appropriate 
stopping rules for their MCMC runs. On the other 
hand, a claim that a test indicated the sampler "con- 
verged" is simply nowhere near enough information 
to objectively judge the quality of the subsequent 
estimation and inference. Discarding a set of initial 
draws does not necessarily improve the situation. 

A key requirement for reporting valid Monte Carlo 
standard errors is that the sampler mixes well. Find- 
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ing a good sampler is likely to be the most challeng- 
ing part of the recipe we describe. We have given no 
guidance on this other than one should look within 
the class of geometrically ergodic Markov chains if 
at all possible. This is an important matter in any 
MCMC setting; that is, a Markov chain that con- 
verges quickly is key to obtaining effective simu- 
lation results in finite time. Thus there is still a 
great deal of room for creativity and research in im- 
proving samplers but there are already many useful 
methods that can be implemented for difficult prob- 
lems. For example, one of our favorite techniques is 
simulated tempering (Geyer and Thompson (1995), 
Marinari and Parisi (1992)) but many others are pos- 
sible. 
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